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ABSTRACT 

The large-scale analysis of scholarly artifact usage is con- 
strained primarily by current practices in usage data archiv- 
ing, privacy issues concerned with the dissemination of usage 
data, and the lack of a practical ontology for modeling the 
usage domain. As a remedy to the third constraint, this 
article presents a scholarly ontology that was engineered to 
represent those classes for which large-scale bibliographic 
and usage data exists, supports usage research, and whose 
instantiation is scalable to the order of 50 million articles 
along with their associated artifacts (e.g. authors and jour- 
nals) and an accompanying 1 billion usage events. The real 
world instantiation of the presented abstract ontology is a 
semantic network model of the scholarly community which 
lends the scholarly process to statistical analysis and com- 
putational support. We present the ontology, discuss its 
instantiation, and provide some example inference rules for 
calculating various scholarly artifact metrics. 

Categories and Subject Descriptors 

1.2.4 [Knowledge Representation Formalisms and Meth- 
ods]: Semantic Networks; H. 3. 7 [Digital Libraries]: Stan- 
dards — ontologies 

General Terms 

Ontologies, Scholarly Communication 

Keywords 

Resource Description Framework and Schema, Web Ontol- 
ogy Language, Semantic Networks 

1. INTRODUCTION 

New publications are added to the scholarly record at an 
accelerating pace. This point is realized by observing the 



evolution of the amount of publications indexed in Thom- 
son Scientific's citation database over the last fifteen years: 
875,310 in 1990; 1,067,292 in 1995; 1,164,015 in 2000, and 
1,511,067 in 2005. However, the extent of the scholarly 
record reaches far beyond what is indexed by Thompson 
Scientific. While Thompson Scientific focuses primarily on 
quality-driven journals (roughly 8,700 in 2005), they do not 
index more novel scholarly artifacts such as preprints de- 
posited in institutional or discipline-oriented repositories, 
datasets, software, and simulations that are increasingly be- 
ing considered scholarly communication units in their own 
right. 

While the size (and growth) of the scholarly record is 
impressive, the extent of its use is even more staggering. 
For instance, in November 2006, Elsevier's Science Direct, 
which provides access to articles from approximately 2,000 
journals, celebrated its 1 billionth full-text download since 
counting started in April of 1999^ And, again, the extent of 
scholarly usage clearly reaches far beyond Elsevier's repos- 
itory. Furthermore, usage events include not only full-text 
downloads, but also events such as requesting services from 
linking servers, downloading bibliographic citations, email- 
ing abstracts, etc. 

To a large extent, the effect of usage behavior on the schol- 
arly process is a horizon that is only beginning to be under- 
stood and, if properly studied, will offer clues to the evo- 
lutionary trends of science [I] [2l [3] , quantitative models of 
the value of scholarly artifacts |4| [H], and services to sup- 
port scholars The Andrew W. Mellon funded MESUEl 1 ] 
project at the Research Library of the Los Alamos National 
Laboratory aims at developing metrics for assessing schol- 
arly communication artifacts (e.g. articles, journals, confer- 
ence proceedings, etc.) and agents (e.g. authors, institu- 
tions, publishers, repositories, etc.) on the basis of scholarly 
usage. In order to do this, the MESUR project makes use 
of a representative collection of bibliographic, citation and 
usage data. This data is collected from a wide variety of 
sources including academic publishers, secondary publish- 
ers, institutional linking servers, etc. Expectations are that 
the collected data will eventually encompass tens of millions 
of bibliographic records, hundreds of millions of citations, 
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1 Elsevier's 1 billion downloads article available at: 
http:/ /www. info. sciencedirect.com/news/archive/2006/ 
news _ billiont h . asp 

2 MEtrics from Scholarly Usage of Resources available at: 
http:/ /www. mesur.org/ 



and billions of usage events. Mining such a vast data set 
in an efficient, performing, and flexible manner presents sig- 
nificant challenges regarding data representation and data 
access. This article presents, the OWL ontology FF] used 
by MESUR to represent bibliographic, citation and usage 
data in an integrated manner. The proposed MESUR on- 
tology is practical, as opposed to all encompassing, in that 
it represents those artifacts and properties that, as previ- 
ously shown in [6], are realistically available from modern 
scholarly information systems. This includes bibliographic 
data such as author, title, identifier, publication date and us- 
age data such as the IP address of the accessing agent, the 
date and time of access, type of usage, etc. Finally, another 
novel contribution of this work is the hybrid storage and 
access architecture in which relational database and triple 
store technology are combined. This is achieved by storing 
core data and relationships in the triple store and auxiliary 
data in a relational database. This design choice is driven 
by the need to keep the size of the triple store to a level that 
can realistically be handled by current technologies. The 
combination of the data architecture and scholarly ontology 
presented in this article provide the foundation for the large- 
scale modeling and analysis of scholarly artifacts and their 
usage. 

2. SEMANTIC NETWORK ONTOLOGIES 

A semantic network (sometimes called a multi-relational 
network or multi-graph) is composed of a set of nodes (repre- 
senting heterogeneous artifacts) connected to one another by 
a set of qualified, or labeled, edges [I]. In a graph theoretic 
sense, a semantic network is a directed labeled graph. Be- 
cause an edge is labeled, two nodes can be connected to one 
another by an infinite number of edges. However, in most 
cases, the possible interconnections between node types is 
constrained to a predetermined set. This predetermined set 
is made explicit in the semantic network's associated ontol- 
ogy. An ontology is generally defined as a set of abstract 
classes, their relationship to one another, and a collection of 
inference rules for deriving implicit relationships f9j. An on- 
tology makes no explicit reference to the actual instances of 
the defined abstract classes; this is the role of the semantic 
network. 

An ontology is related to the developer's API in object ori- 
ented programming languages such as C++ and Java (minus 
the explicit representation of class methods/functions). For 
example, the set of relationships of an ontological class are 
known as the class' properties and, in the object oriented 
lexicon, can be understood as class fields. Also, a taxon- 
omy is usually expressed in a semantic network ontology. A 
taxonomy of sub- and super-classes support the inheritance 
of class properties. For instance, if all mammals are warm 
blooded, then all humans are warm blooded because all hu- 
mans are mammals. In an inheritance hierarchy, the warm 
blooded property of mammals is inherited by all sub-classes 
of mammal (e.g. human). 

Figure [l] diagrams the relationship between an ontology 
and its semantic network instantiation. The circles repre- 
sents objects that are instances of the dash-dot pointed to 
abstract classes (the squares). The three lower squares are 
subclasses of a more general top-level class (denoted by the 
dashed edges). The horizontal edges in the ontology denote 
permissible property types in the instantiation and thus, 
corresponding horizontal labeled edges in the semantic net- 



work may exist. Figure [T] does not expose the range of con- 
ceptual nuances that can be expressed by modern ontology 
languages and thus, only provides a rudimentary representa- 
tion of the relationship between an ontology and its semantic 
network instantiation. 
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Figure 1: The relationship between an ontology and 
its semantic network instantiation 



2.1 Semantic Network Technology 

The most popular semantic network representational frame- 
work is the Resource Description Framework and Schema, 
or RDF(S) [To]. RDF(S) represents all nodes and edges 
by Universal Resource Identifiers (URI) [TT] . The URI ap- 
proach supports the use of namespacing such that the URI 
http : / /www . science . org#Article has a different mean- 
ing, or connotation, than what may be understood by the 
URI http : / /www . newspaper . net # Art icle. 

The Web Ontology Language (OWL) is an extension of 
RDF(S) that supports a richer vocabulary (e.g. promotes 
many set theoretical concepts) [7]. Protegq^jis perhaps the 
most popular application for designing OWL ontologies [l2] . 
While OWL is primarily a machine readable language, an 
OWL ontology can be diagrammed using the Unified Mod- 
eling Language's (UML) class diagrams (i.e. entity relation- 
ship diagrams). 

Modern semantic network data stores represent the rela- 
tionship between two nodes by a triple. For instance, the 
triple 

(URL, http://xmlns.eom/foaf/0.l/#knows, URI&) 
states that the resource identified by URI a knows the re- 

^TOaj- http://xmlns.eom/foaf/0.l/#knows -►^UM^ 

Figure 2: A diagrammed triple 

source identified by URIt, where URI a and URIt, are nodes 
and http : / /xmlns . com/ f oaf /O . 1 / # knows is a directed 
labeled edge (see Figure[2|. The meaning of knows is fully 
defined by the URI http : //xmlns . com/f oaf /O . 1/. The 
union of instantiated FOAF triples is a FOAF semantic net- 
work. Current platforms for storing and querying such se- 
mantic networks are called triple stores. Many open source 
and proprietary triple stores currently exist. Various query- 
ing languages exist as well UM- The role of the query lan- 
guage is to provide the interface to access the data contained 
in the triple store. This is analogous to the relationships 
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between SQL and relational databases. Perhaps the most 
popular triple store query language is SPARQL 
example SPARQL query is 

SELECT ?x 

WHERE ( ?x foaf : knows vub:cgershen ) . 

In the above query, the ?x variable is bound to any node 
that is the domain of a triple with an associated predicate of 
http://xmlns.eom/foaf/0.l/#knows and a range of 
http : / /homepages . vub . ac . be/ tcgershen. Thus, the 
above query returns all people who know vub:cgershen 
(i.e. Carlos Gershenson). 

The ontology plays a significant role in many aspects of 
a semantic network. Figure [3] demonstrates the role of the 
ontology in determining which real world data is harvested, 
how that data is represented inside of the triple store (se- 
mantic network), and finally, what queries and inferences 
are possible to execute. 



Ontology 





i ^ Query 



Figure 3: The many roles of an ontology 



3. SCHOLARLY ONTOLOGIES 

In general, an ontology's classes, their relationships, and 
inferences are determined according to what is being mod- 
eled, for what problems that model is trying to solve, and 
how that model's classes can be instantiated according to 
real world data. Thus, there were three primary require- 
ments to the development of the MESUR ontology: 

1. realistically available real world data 

2. ability to study usage behavior 

3. scalability of the triple store instantiation. 

Without real-world data, an ontology serves only as a con- 
ceptual tool for understanding a particular domain and, in 
such cases, ontologies of this nature may be very detailed 
in what they represent. However, for ontologies that are 
designed to be instantiated by real world data, the ontol- 
ogy is ultimately constrained by data availability. Thus, the 
MESUR ontology is constrained to bibliographic and usage 
data since these are the primary sources of scholarly data. 
In the scholarly community, while articles, journals, confer- 
ence proceedings, and the like are well documented and rep- 
resented in formats that lend themselves to analysis, other 
information, such as usage data, tends to be less explicit due 
to the inherent privacy issues surrounding individual usage 
behavior. Therefore, a primary objective of the MESUR 
project is the acquisition of large-scale usage data sets from 
providers world-wide. 

The purpose of the MESUR project is to study usage be- 
havior in the scholarly process and therefore, usage modeling 



is a necessary component of the MESUR ontology. Given 
both usage and bibliographic data, it will be possible to gen- 
erate and validate metrics for understanding the 'value' of 
all types of scholarly artifacts. Currently, the scholarly com- 
munity has one primary means of understanding the value 
of a journal and thus its authors: the ISI Impact Factor 
|15| . With a semantic network data structure that includes 
not only article (and thus, journal) citation, but also au- 
thorship, usage, and institutional relationships, new metrics 
that not only rank journals, but also conferences, authors, 
and institutions will be created and validated. 

Finally, the proposed ontology was engineered to han- 
dle an extremely large semantic network instantiation (on 
the order of 50 million articles with a corresponding 1 bil- 
lion usage events). The MESUR ontology was engineered 
to make a distinction between required base-relationships 
and those, that if needed, can be inferred from the base- 
relations. Futhermore, due to the fact that the MESUR 
ontology was developed to support the large-scale analysis 
of usage, many of the metadata properties such as article 
title or author name are not explicitly represented in the 
ontology and thus, as will be demonstrated, such data can 
be accessed outside the triple store by reference to a rela- 
tional database. 

4. RELATED WORK 

Other efforts have produced and exploited scholarly on- 
tologies, but they do not cover the needs of the MESUR 
project for two primary reasons. First, they generally lack 
the integration of publication, citation and usage data, which 
MESUR requires in order to represent and analyze these cru- 
cial stages of the public scholarly communication process. 
Second, scalability appears to not have been a major con- 
cern when designing the ontologies and thus, instantiating 
them at the order of what MESUR will be representing is 
unfeasible. Sometimes, the ontology is too elaborate, adding 
complexity that rarely pays off for the simple reason that it 
is hard to realistically come by data to populate defined 
properties (e.g. detailed author or affiliation information). 
Other times, the ontology requires the storage of informa- 
tion that cannot realistically be represented for vast data 
collections using current triple store technologies. 

Several scholarly ontologies are available in the DAML 
Ontology LibrarMj While they focus on bibliographic con- 
structs, they do not model usage events. The same is true of 
the Semantic Community Web Portal ontology [16], which, 
in addition maintains many detailed classes whose instanti- 
ation is unrealistic given what is recorded by modern schol- 
arly information systems. 

The ScholOnto ontology was developed as part of an ef- 
fort aimed at enabling researchers to describe and debate, 
via a semantic network, the contributions of a document, 
and its relationship to the literature [IT] . While this on- 
tology supports the concept of a scholarly document and a 
scholarly agent, it focuses on formally summarizing and in- 
teractively debating claims made in documents, not on ex- 
pressing the actual use of documents. Moreover, support for 
bibliographic data is minimal whereas support for discourse 
constructs, not required for MESUR, is very detailed. 

The ABC ontology 18 was primarily engineered as a com- 
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mon conceptual model for the interoperability of a variety 
of metadata ontologies from different domains. Although 
the ABC ontology is able to represent bibliographic and us- 
age concepts by means of constructs such as artifact (e.g. 
article), agent (e.g. author), and action (e.g. use), it is de- 
signed at a level of generality that does not directly support 
the granularity required by the MESUR project. 

An interesting ontology-based approach was developed by 
the fngenta MetaStore project 19 . Unfortunately, again, 



the Ingenta ontology does not support expressing usage of 
scholarly documents, which is a primary concern in MESUR. 
Nevertheless, the approach is inspiring because Ingenta faces 
significant challenges regarding scalability of the ontology- 
based representation, storage and access of their bibliographic 
metadata collection, which covers approximately 17 million 
journal articles. However, the scale of the MESUR data set 
is several orders of magnitude larger, calling for optimiza- 
tions wherever possible. For example, given the MESUR 
project's focus on usage, storing bibliographic properties 
(author names, abstract, titles, etc.) in the triple store, as 
done by Ingenta, is not essential. As a result, in order to im- 
prove triple store query efficiency, MESUR stores such data 
in a relational database, and the MESUR ontology does not 
explicitly represent these literals. 

The principles espoused by the Ontology ontology are 
inspiring. OntologyX uses context classes as the "glue" for 
relating other classes, an approach that was adopted for the 
MESUR ontology. For instance, the MESUR ontology does 
not have a direct relationship between an article and its 
publishing journal. Instead, there exists a publishing con- 
text that serves as an N-ary operator uniting a journal, the 
article, its publication date, its authors, and auxiliary infor- 
mation such as the source of the bibliographic data. The 
context construct is intuitive and allows for future exten- 
sions to the ontology. OntologyX also helped to determine 
the primary abstract classes for the MESUR ontology. Un- 
fortunately, OntologyX is a proprietary ontology for which 
very limited public information is available, making direct 
adoption unfeasible for MESUR. As a matter of fact, all in- 
spiration was derived from a single PowerPoint presentation 



from the 2005 FBRB Workshop [20 

Finally, in the realm of usage data representation, no 
ontology-based efforts were found. Nevertheless, the fol- 
lowing existing schema-driven approaches were explored and 
served as inspiration: the OpenURL ContextObject approach 
to facilitate OAI-PMH-based harvesting of scholarly usage 
even ts |6| , the XML Log standard to represent digital library 
logs |21|, and the COUNTER schema to express journal level 
usage statistics [22]. 



store technologies that are not easily reproducible within 
the relational database framework include ease of schema 
extension and ontological inferencing. 

A novel contribution of the presented ontology is its so- 
lution to the problem of scalability found in modern triple 
store technologies [23]. While semantic networks provide a 
flexible medium for representing and searching knowledge, 
current triple store applications do not support the amount 
of data that can be represented at the upper limit of what 
is possible with modern relational database technologies. 
Therefore, it was necessary to be selective of what infor- 
mation is actually modeled by the MESUR ontology. For 
the MESUR project, much of the data associated with each 
scholarly artifact is maintained outside the triple store in a 
relational database. 

The typical bibliographic record contains, for example, 
an article's identifiers (e.g. DOI, SICI, etc.), authors, title, 
journal/conference/book, volume, issue, number, and page 
numbers. Typical usage information contains, for example, 
the users identifier (e.g. IP address), the time of the usage 
event, and a session identifier. An example of the various 
bibliographic and usage properties are outlined in the Ta- 
ble [T] and Table [2] respectively. Note that the connection 
between the bibliographic record and the usage event oc- 
curs through the docjd (bolded properties). The doc_id is 
a internally generated identifier created during the MESUR 
project's ingestion process. 



property 


value 


title 


The Convergence of Digital Libraries ... 


author(s) 


Rodriguez, Bollcn, Van dc Sompcl 


collection 


Journal of Information Science 


publisher 


Sage Publications 


date 


2006 


start page 


149 


end page 


159 


volume 


32 


issue 


2 


doi 


10.1177/0165551506062327 


doc_id 


b5elab73-26b5-41f0-a83f-b47b4d737 



Table 1: Example bibliographic properties 



property 


value 


event _id 


45563ac2-c7d4- 1669-ab! c-ac512')535oc5 


time 


2006-09-27 00:00:03 


agent 


4AD2FD457EB59CE08AAAF6EA2A63F 


session 


C3044206 


affiliation 


California State University, Los Angeles 


doc_id 


b5elab73-26b5-41f0-a83f-b47b4d737 



5. LEVERAGING RELATIONAL DATABASE 
TECHNOLOGY 

The MESUR project makes use of a triple store to rep- 
resent and access its collected data. While the triple store 
is still a maturing technology, it provides many advantages 
over the relational database model. For one, the network- 
based representation supports the use of network analysis 
algorithms. For the purposes of the MESUR project, a 
network-based approach to data analysis will play a major 
role in quantifying the value of the scholarly artifacts con- 
tained within it. Other benefits that are found with triple 



OntologyX available at: http://www.ontologyx.com/ 



Table 2: Example usage properites 

The two tables demonstrate how bibliographic and usage 
data can be easily represented in a relational database. From 
the relational database representation, a RDF N-Tripl«[^]data 
file can be generated. One such solution for this relational 
database to triple store mapping is the D2R mapper [24] . 
However, note that not all data in the relational database 
is exported to this intermediate format. Instead, only those 
properties that promote triple store scalability and usage 
research were included. Thus, article titles, journal issues 

6 N-T riple available at: 

http:/ /www. w3.org /2001/sw/RDFCore/ntriples/ 



and volumes, names of authors, to name a few, are not ex- 
plicitly represented within the triple store and thus, are not 
modeled by the ontology. If a particular artifact property 
that is not in the ontology is required for a computation, 
the computing algorithm references the relational database 
holding the complete representation the acquired data. For 
example, bi-directional resolution of the artifact with docud 
2 is depicted in Figure [4] where the resolving identifier is 
specific to the artifact (for the sake of diagram readability, 
assume that 2 is b5elab73-26b5-41f0-a83f-b47b4d737 from 
Table [T] and [2| . This model is counter to what is seen in 
other scholarly ontologies such as the Ingenta ontology |l9] . 
This design choice was a major factor that prompted the 
engineering of a new ontology for bibliographic and usage 
modeling. 
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doi 


title 


1 


doi:10/jm.. 


"A Me.." 


2 


doi:10.1.. 


"The C" 



Triple Store 



t ©' 



Figure 4: The relationship between the relational 
database and the triple store 



6. THE MESUR ONTOLOGY 

The MESUR ontology is currently at version 2007-01 at 
http : / /www . mesur . org/s enemas/ 2 007-0 1/mesur (ab- 
breviated mesur). Full HTML documentation of the ontol- 
ogy can be found at the namespace URL The following sec- 
tions will describe how bibliographic and usage data is mod- 
eled to meet the requirements of understanding large-scale 
usage behavior, while at the same time promoting scalabil- 
ity. 

6.1 The Primary Classes 

The most general class in OWL is owl: Thing. The 
MESUR ontology provides three subclasses of owl : Thing. 
These MESUR classes are mesur : Agent, mesur : Document, 
and mesur : Context^ This is represented in Figure [H] 
where an edge denotes a rdf s : subClassOf relationship. 



owkThing 



T 



> 



Agent 



3 



Figure 5: The primary classes of the MESUR ontol- 
ogy 

The Context classes serve as the "glue" by which Agents 
and Documents interact. A Context is analogous to rdf : Bag 
in that it is an N-ary operator unifying the literals and 
objects pointed to by its respective properties. All rela- 
tionships between Agents and Documents occurs through 



a particular Context. However, as will be demonstrated, 
direct relationships can be inferred. All inferred properties 
are denoted by the "(i)" notation in the following UML class 
diagrams. All inferred properties are superfluous relation- 
ships since there is no loss of information by excluding their 
instantiation (the information is contained in other relation- 
ships). The algorithms for inferring them will be discussed 
in their respective Context subsection. 

Currently, all the MESUR classes are specifications or 
generalizations of other classes. No holonymy/meronymy 
(composite) class definitions are used at this stage of the on- 
tology's development. Figure[6]presents the complete taxon- 
omy of the MESUR ontology. This diagram primarily serves 
as a reference. Each class will be discussed in the following 
sections. 




7 For the remainder of this article, all classes that are not 
explicitly namespaced are from the mesur namespace. 



Figure 6: MESUR taxonomy 

6.2 The Agent Classes 

The Agent taxonomy is diagrammed in Figure [7] An 
Agent can either be a Human or an Organization. A 
Human is an actual individual whether that individual can 
be uniquely identified (e.g. an document author) or not 
(e.g. a document user). The authored property is an in- 
ferred relationship and denotes that an Agent authored a 
particular Document and the published property denotes 
that an Agent has published a Document. The authored 
and published property can be inferred by information 
within the Publishes context discussed later. Similarly, 
the used property denotes that an Agent has used a par- 
ticular Document. The used property can be inferred from 
the Uses context. 

An Organization is a class that is used for both bib- 
liographic and usage provenance purposes. Given that bib- 
liographic and usage data, at the large-scale, must be har- 
vested from multiple institutions, it is necessary to make 
a distinction between the various data providers. In many 
cases, an Organization can be both a bibliographic (e.g. a 
publisher) and a usage (e.g. a repository) provider. Further- 
more, an Organization can also be an author's academic 
institution (e.g. a university). 

Finally, all Agents can have any number of affiliations. 
For an Organization, this is a recursive definition which 
allows an Organization to have many affiliate Organizations 
while at the same time allowing for the Human leaf nodes of 
an Organization to be represented by the same construct. 



The rules governing the inference of the hasAf filiation 
and hasAf filiate properties are discussed in the section 
describing the Affiliation context. 



Agent 



hasAffiliation: Organization [0..*] (i) 
authored: Document [0..*] (i) 
used: Document [0..*] (i) 
published: Group [0..*] (i) 



Organization 



hasAffiliate: Agent [0..'] (i) 



Figure 7: Classes of Agent and their properties 

6.3 The Document Classes 

A Document is an abstract concept of a particular schol- 
arly product such as those depicted in Figure [8] 



Document 



authoredBy: Agent [0..*] (i) 
usedBy: Agent [0..*] (i) 
publishedBy: Agent [0..1] (i) 



Book | 






Art 


cle 


containedin: Group [0..1] (i) 



1 



Group 



contains: Article [0..*] (i) 



23 



3 



- \ ConferenceArticle | 
- | JournalArticle | 
| Preprint Article \ - 



EditedBook 



contains: BookArticle [0..*] (i) 



- | Proceedings | 

* 











ProceedingsEdition 




partOf: Proceedings [1] 
haslssue: xsd:int [0..1] 










Journa 


Edition 


partOf: Journal [1] 
haslssue: xsd:int [0..1] 
hasVolume: xsd:int [0..1] 



Figure 8: Classes of Document and their properties 



In general, Document objects are those artifacts that are 
written, used, and published by Agents. Thus, a Document 
can be a specific article, a book, or some grouping such as 
a Journal, conference Proceedings, or an EditedBook. 
There are two Document subclasses to denote whether the 
Document is a collection (Group) or an individually writ- 
ten work (Unit). A Journal and Proceedings is an ab- 
stract concept of a collection of volumes/issues. An edi- 
tion to a proceedings or journal is associated with its ab- 
stract Group by the partOf property. The authoredBy, 
containedin, publishedBy, and contains properties 
can be inferred from the Publishes context. Also, the 
usedBy property can be inferred from the Uses context. 

6.4 The Context Classes 

As previously stated, all properties from the Agent and 
Document classes that are marked by the "(i)" notation are 
inferred properties. These properties can be automatically 
generated by inference algorithms and thus, are not required 
for insertion into the triple store. What this means is that 
inherent in the triple store is the data necessary to infer 
such relationships. Depending on the time (e.g. query com- 
plexity) and space (e.g. disk space allocation) constraints, 



the inclusion of these inferred properties is determined. At 
any time, these properties can be inserted or removed from 
the triple store. The various inferred properties are de- 
termined from their respective Context objects. There- 
fore, the MESUR owl : Ob jectProperty taxonomy pro- 
vides two types of object properties: ContextProperty 
and Inf erredProperty (see Figure[9|. 



I rdf:Property | 



| owhObjectProperty | | owl:DatatypeProperty | 



[ ContextProperty ] 
- f InferredProperty ] 

Figure 9: The abstract MESUR property classes 

A Context class is an N-ary operator much like an rdf : Bag. 
Current triple store technology expresses tertiary relation- 
ships. That means that only three resources are related 
by a semantic network edge (i.e. a subject URI, predicate 
URI, and object URI). However, many real- world relation- 
ships are the product of multiple interacting objects. It is 
the role of the various Context classes to provide relation- 
ships for more than three URIs. The Context classes are 
represented in Figure [To] 



hasTime: xsd:datetime [1] 
hasProvider: Agent [Q..1] 



Uses 



hasUser: Agent [1] 
hasAccess: xsd:string [0..1] 
hasSession: xsd:string [0..1] 
hasDocument: Document [1] 



Publishes 



hasGroup: Group [0..1] 
hasUnit: Unit [0..1] 
hasAuthor: Agent [1 ..*] 
hasPublisher: Agent [Q..1] 



hasStartTime: xsd:datetime [0..1] 
hasEndTime: xsd:datetime [0..1] 



J 



Weighted Relationship 



hasSink: Agent or Document [ 
hasSource: Agent or Document [1] 
hasWeight: xsd:float [0..1] 



Citation | 



| Coauthor (i) | 



hasAffiliator: Organization [ 
hasAffiliatee: Agent [1] 



hasSpec: xsd:string [0..1] 
hasObject: Agent or Document [1 ] 



NumericMetric 



hasNumericValue: xsd:float [1] 



Nominal Metric 



hasNominalValue: xsd:string [1] 



Figure 10: Classes of Context and their properties 

The Context class has two subclasses: Event and State. 
An Event is some measurement done by some provider at 
a particular point in time. For example, the Publishes 
and Uses events are recorded by publisher and repositories 
at some point in time. As a side note, the hasProvider 
property of the Event class is an efficient model for the 
representation of provenance constructs. Instead of reifying 
every statement with provenance data (e.g. triple x was sup- 
plied by provider y 
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a single triple is provided for each 
Event (e.g. event x was supplied by provider y). 

On the other side of the Context taxonomy are the State 
contexts. A State is some measurement that can, in some 
cases, occur over a span of time and are used to represent 



complex relationships between artifacts or as a way of at- 
taching high-level properties (i.e. metadata) to an artifact. 
The next sections will provide a detailed description of each 
Context class along with SPAQRL queries for inferring all 
the aforementioned Inf erredProperty properties. 

6.4.1 The Publishes Context 

A Publishes event states, in words, that a particular 
bibliographic data provider has acknowledged that a set 
of authors have authored a unit that was published in a 
group by some publisher at a particular point in time. A 
Publishes object relates a single bibliographic data provider, 
Agent authors, a Unit, an Agent publisher, a Group, and 
a publication ISO-8601 date time literaQ Figure [Tl] rep- 
resents a Publishes context and the inferable properties 
(dashed edges) of the various associated artifacts. All in- 
ferred properties have a respective inverse relationship. Note 
that both PreprintArticle and Book publishing are rep- 
resented with OWL restrictions (i.e. they are not published 
in a Group). The details of these restrictions can be found 
in the actual ontology definition. 




Figure 11: Example Publishes Context 

The dashed edges in Figure [TP] denote properties that are 
a rdf s : subClassOf the Inf erredProperty. For in- 
stance, the abstract triple (Author, authors, Document) 
is inferred given the results of the following SPARQL query, 
where for the sake of brevity, the PREFIX declarations are 
removed and the INSERT statement represents the insert of 
its triple argument into the triple stora 



SELECT 
WHERE 



7 b 



8 ISO-8601 available at: 
datetime/ 



http: // www.w3.org/TR/NOTE- 



( ?x rdf:type mesur : P ub 1 i s hes ) 
( ?x mcsur : hasUnit ?a ) 
( ?x mcsur : hasAuthor ?b ) 

INSERT < ?a mcsur : authorcdBy ?b > 
INSERT < ?b mcsur : authored ?a > . 

To infer the Group property contains and Unit prop- 
erty containedln, the following SPARQL query and INSERT 
statements suffice. 

SELECT ?a ?b 
WHERE 

( ?x rdf:type mesur : P ub 1 i s hes ) 
( ?x mcsur : hasUnit ?a ) 
( ?x mesur : hasGroup ?b ) 

INSERT < ?a mcsur : containedln ?b > 
INSERT < ?b mcsur : contains ?a > . 

Finally, the published and publishedBy properties 
are inferred by: 



SELECT 
WHERE 



?a ?b 

( ?x rdf:typc mesur : P ub li s h es 
( ?x mcsur: hasPublishcr ?a ) 
( ?x mcsur : hasGroup ?b ) 



INSERT < ?a mcsur : published ?b > 
INSERT < ?b mcsur : publishedBy ?a > . 

6.4.2 The Uses Context 

The Uses context denotes a single usage event where 
an Agent uses a Document at a particular point in time. 
The Uses context is diagrammed in Figure |12| Like the 
Publishes context, the Uses context is an N-ary con- 
struct. Depending on the usage provider, a session identifier 
and access type is recorded. A session identifier denotes the 
user's login session. An access type denotes, for example, 
whether the used Document had its abstract viewed or was 
fully downloaded. 




"Please note that all the presented SPARQL queries are not 
optimized for speed, but instead, are optimized for readabil- 
ity. 



Figure 12: Example Uses Context 

The following SPARQL query and INSERT statements 
represent the inference of the usedBy and used inverse 
properties of an Article document and Agent, respec- 
tively. Also, note the last two INSERT statements. These 
statements demonstrate how Group usage information can 
also be inferred. 



SELECT 
WHERE 



?a ?b ?c 



( ?x rdf:type mesur : Uses 



SELECT 
WHERE 



x mesur : hasDocumcnt ?a ) 

a rdf : type mesur : Article ) 

x mesur ihasUser ? b ) 

y rdf:typc mesur : Publishes 

y mesur : hasUnit ?a ) 

y mesur : hasGroup ?c ) 



INSERT < ?a mesur : usodBy ?b > 

INSERT < ?b mesur: used ?a > 

INSERT < ?c mesur : usodBy ?b > 

INSERT < ?b mesur: used ?c > . 



6.4.3 The Weighted Relationship Context 

In many instances, one artifact is related to another by 
a particular semantic. However, in some instance, one arti- 
fact is related to another by a semantic label and a floating 
point weight value. Furthermore, that weighted relation- 
ship may have been recorded over some period of time. The 
WeightedRelationship state context is used to represent 
such relationships. 

The Citation state context denotes a weighted citation 
and is a rdf s : subClassOf the WeightedRelationship. 
For Unit to Unit citation, the weight value is 1.0 (or no 
weight property to reduce the triple store footprint) and 
there are no start and end time points. However, for Group 
to Group citations, the weight of the Citation represents 
how many times a particular Group cites another over some 
period of time. Hence, it is necessary to denote the start and 
end points of both the source and the sink nodes. Figure 
|13| diagrams a Citation context. Furthermore, the sink 
and source types can be either an Agent or a Document, 
thus, Organization to Organization citations can be 



represented. 




Figure 13: Example Citation Context 

Given Unit to Unit citations, the Citation weight be- 
tween any two Groups can be inferred. The following ex- 
ample SPARQL query generates the Citation object for 
citations from 2007 articles in the Journal of Informetrics 
(ISSN: 1751-1577) to 2005-2006 articles in Scientometrics 
(ISSN: 0138-9130). Assume that the URI of the journals 
are their ISSN numbers, the date time is represented as a 
year instead of the lengthy ISO-8601 representation, and the 
COUNT command is analogous to the SQL COUNT command 
(i.e. returns the number of elements returned by the variable 
binding). 



?x rdf:typc mesur : C i t at ion ) 

?x mesur : hasSourcc ?a ) 

?x mesur : hasSink ?b ) 

?a rdf:typc mesur: Article ) 

?b rdf: type mesur : A r t i c 1 e ) 

?y rdf:type mesur : Publishes ) 

?z rdf:typc mesur : Publishes ) 

?y mesur : hasTimc ?t) 

AND (?t > 2004 AND ?t < 2007) 

?z mesur : hasTimc ?u) AND ?u = 2007 

?y mesur : hasUnit ?a ) 

?z mesur : hasUnit ?b ) 

?y mesur : hasGroup ?c ) 

?z mesur : hasGroup ?d ) 

? c mesur : partOf urn: issn:1751 — 1577 ) 

?d mesur : partOf urn : i s s n : 138 — 9 1 30 ) 



INSERT < _123 rdf:typo mesur : C i t at io n > 

INSERT < _123 mesur : hasSourcc urn : i s s n : 1 75 1 - 1 5 77 > 

INSERT < _123 mesur : hasSink urn : i s s n : 1 38 - 9 1 30 > 

INSERT < _123 mesur : hasWeight COUNT(?x) > 

INSERT < _123 mesur . hasSourccStartTime 2007 > 

INSERT < _123 mesur : hasSourceEndTime 2007 > 

INSERT < _123 mesur . hasSinkStartTimc 2005 > 

INSERT < _123 mesur : hasSinkEndTime 2006 > . 

Figure [14| diagrams the Coauthor weighted relationship 
context. The weight value of this relationship denotes the 
number of times two authors have coauthored together over 
a some period of time. 




Figure 14: Example Coauthor Context 

The following SPARQL query demonstrates how to infer 
the weighted Coauthor relationship between the authors 
Marko (lanl :marko) and Herbert (lanl :herbertv) over 
all time. A time period for coauthorship counting can be 
inserted in a fashion similar to the Citation example pre- 
vious. 

SELECT ?x 
WHERE 

( ?x rdf:typc mesur : P ub li s h cs ) 

( ?x mesur : hasAuthor lanl: marko ) 

( ?x mesur : hasAuthor 1 a n 1 : h cr b c r t v ) 

INSERT < _123 rdf: type mesur : Coauthor > 

INSERT < _123 mesur : hasSourcc lanl: marko > 

INSERT < _123 mesur : hasSink 1 a n 1 : h e r be r t v > 

INSERT < _123 mesur : hasWeight COUNT(?x) > 

INSERT < _456 rdf : type mesur : Coauthor > 

INSERT < _456 mesur : hasSourcc lanl : hcrbcrtv > 

INSERT < _456 mesur : hasSink lanl: marko > 

INSERT < _456 mesur : hasWeight COUNT(?x) > . 

6.4.4 The Affiliation Context 

An Affiliation context denotes that a particular Human 
is affiliated with an Organization or that an Organization 



is affiliated with another Organization. An Affiliation 
can be represented as occurring over a particular period of 
time. An example of an Affiliation state context is di- 
agrammed in Figure [l5| 





Agent 



t 

rdf:type 
hasAffiliatee I 



2006-1 1 -30T1 7:06:00-07:00 



Figure 15: Example Affiliation Context 

The hasAf filiate and hasAf filiation properties of 
the Agent classes can be inferred by the following SPARQL 
query. 

SELECT ?a ?b 
WHERE 

( ?x rdf:typc mcsur: Affiliation ) 
( ?x mesur:hasAffiliator ?a ) 
( ? x mcsur: hasAffiliatee ? b ) 

INSERT < ?a mcsur : h as A f f i 1 i a t c ?b > 
INSERT < ?b mcsur : hasAffiliation ?a > . 

6.4.5 The Metric Context 

The primary objective of the MESUR project is to study 
the relationship between usage-based value metrics (e.g. Us- 
age Impact Fact or [5] ) and citation-based value metrics (e.g. ISI 
Impact Factor [15] and the Y- Factor [25]). The Metric 
context allows for the explicit representation of such met- 
rics. The Metric context has both the NumericMetric 
and NominalMetric subclasses. Figure [l6| diagrams the 
2007 ImpactFactor numeric metric context for a Group. 
Note that the Context hierarchy in Figure [To] does not rep- 
resent the set of Metrics explored by the MESUR project. 
This taxonomy will be presented in a future publication. 




Group 



t 



rdf:type 

I hasObject 




hasSpec 



q ^ j ISI provided 
0..1 



C, hasEndTime 
>r / hasStartTime ^\ 

\ hasNumericValue \ 
/ / \ 2007-12-30700:00.00-00.00 

1.78 I 



2007-0 1 -0 1 TOO :00 :00-00:00 



Figure 16: Example Impact Factor Context 

The example SPARQL query and respective INSERT state- 
ments demonstrate how to calculate the 2007 Impact Factor 
for the Proceedings of the Joint Conference on Digital Li- 
braries (JCDL ISSN: 1082-9873). The 2007 Impact Factor 



for the JCDL is defined as the number of citations from any 
Unit published in 2007 to articles in the JCDL proceedings 
published in either 2005 or 2006 normalized by the total 
number of articles published by JCDL in 2005 and 2006 
fl5l 



SELECT 
WHERE 



?b 



rdf:type mcsur : Publishes ) 
mesur : hasUnit ?a ) 
mcsur : hasGroup ?b ) 

mcsur : partOf urn : i s s n : 1 082 — 9873 ) 
mcsur : hasTime ?t ) AND 

(?t > 2004 AND ?t < 2007) 
rdf:type mesur : C i t at ion ) 
mcsur : hasSourcc ?c ) 
mesur : hasSink ?a ) 
rdf:type mesur : P ub 1 i s h cs ) 
mesur : hasUnit ?c ) 

mesur : hasTime ?u) AND ?u = 2007 



SELECT 
WHERE 



"df:type mesur : P ub li s h es 
mesur : hasGroup ?a ) 



( ?a mesur: partOf urn :issn:1082 — 9873 ) 
( ?y mcsur : hasTime ?t ) AND 

(?t > 2004 AND ?t < 2007) 

INSERT < _123 rdf:typc mcsur : ImpactFactor > 
INSERT < _123 mcsur : hasObject urn : i s s n : 1 082 - 9873 > 
INSERT < _123 mcsur : hasStartTime 2007 > 
INSERT < _123 mcsur : hasEndTime 2007 > 
INSERT < _123 mcsur : hasNumbericValue 

(COUNT(?x) / COUNT(?y)) > . 

The 2007 Usage Impact Factor for the JCDL Proceedings 
can be calculated by using the following SPARQL queries 
and INSERT commands. The 2007 Usage Impact Factor for 
the JCDL is defined as the number of usage events in 2007 
that pertain to articles published in the JCDL proceedings 
in either 2005 or 2006 normalized by the total number of 
articles published by the JCDL in 2005 and 2006 [BJ. 

SELECT ?x 
WHERE 



SELECT 
WHERE 



?x rdf:type mesur : Uses ) 

? x mcsur : hasDocumcnt ?a ) 

fx mcsur : hasTime ?t ) AND ?t = 2007 

?y rdf:typc mcsur: Publishes ) 

?y mcsur : hasUnit ?a ) 

?y mesur : hasGroup ?c ) 

? c mesur : partOf urn :issn:1082 — 9873 ) 

?y mcsur : hasTime ?u ) AND 

(?u > 2004 AND ?u < 2007) 



?y rdf:typc mcsur : P ub 1 i s h cs ) 
?y mcsur : hasGroup ?a ) 

?a mcsur : partOf urn :issn:1082 — 9873 ) 
?y mcsur : hasTime ?t ) AND 

(?t > 2004 OR ?t < 2007) 



INSERT < _123 rdf:typc mcsur : UsagcImpactFactor > 
INSERT < _123 mcsur : hasObject urn : i s s n : 1 082 - 9873 > 
INSERT < _123 mcsur : hasNumericValue 

(COUNT(?x) / COUNT(?y)) > . 

As demonstrated, the presented metrics can be easily cal- 
culated using simple SPARQL queries. However, more com- 
plex metrics, such as those that are recursive in definition, 
can be computed using other semantic network algorithms. 
For example, the eigenvector-based Y-Factor |25| can be 
computed in semantic networks using th e g rammar-based 

The objec- 



random walker framework presented in 26 



tive of the MESUR project is to understand the space of 
such metrics and their application to valuing artifacts in the 



scholarly community. Future work in this area will report 
the finding that are derived from such algorithms. 

7. CONCLUSION 

This article presented the MESUR ontology which has 
been engineered to provide an integrated model of biblio- 
graphic, citation, and usage aspects of the scholarly com- 
munity. The ontology focuses only on that information for 
which large-scale real world data exists, supports usage re- 
search, and whose instantiation is scalable to an estimated 
50 million articles and 1 billion usage events. A novel ap- 
proach to data representation was defined that leverages 
both relational database and triple store technology. The 
MESUR project was started in October of 2006 and thus, 
is still in its early stages of development. While a trim on- 
tology has been presented, the effects of this ontology on 
load and query times is still inconclusive. Future work will 
present benchmark results of the MESUR triple store. 

8. ACKNOWLEDGMENTS 

This research is supported by a grant from the Andrew 
W. Mellon Foundation. 

9. REFERENCES 

[1] M. J. Kurtz, G. Eichhorn, A. Accomazzi, C. S. Grant, 
M. Demleitner, and S. S. Murray, "The bibliometric 
properties of article readership information," Journal 
of the American Society for Information Science and 
Technology, vol. 56, no. 2, pp. 111-128, 2005. 

[2] T. Brody, S. Harnad, and L. Carr, "Earlier web usage 
statistics as predictors of later citation impact." 
Journal of the American Society for Information 
Science and Technology, vol. 57, no. 8, pp. 1060 - 
1072, 2006. 

[3] J. Bollen and H. Van de Sompel, "Mapping the 
structure of science through usage," Scientometrics, 
vol. 69, no. 2, 2006. 

[4] J. Bollen, H. Van de Sompel, J. Smith, and R. Luce, 
"Toward alternative metrics of journal impact: a 
comparison of download and citation data," 
Information Processing and Management, vol. 41, 
no. 6, pp. 1419-1440, 2005. [Online]. Avail able: 
|http://www.arxiv.org/pdf/cs.DL/0503007| 

[5] J. Bollen and H. Van de Sompel, "Usage impact 
factor: the effects of sample characteristics on 
usage-based impact metrics," Los Alamos National 
Laboratory, Tech. Rep., 2006. [Online]. Available: 
http: //arxiv.org/abs/cs/0610154l 

[6] - — , "An architecture for the aggregation and 

analysis of scholarly usage data," in Joint Conference 
on Digital Libraries (JCDL06), Chapel Hill, NC, June 
2006, pp. 298-307. 

[7] D. L. McGuinness and F. van Harmelen, "OWL web 
ontology language overview," February 2004. [Online]. 
Available: http:/ /www. w3.org/TR/owl-features/ 

[8] J. F. Sowa, Ed., Principles of Semantic Networks: 
Explorations in the Representation of Knowledge. 
San Mateo, CA: Morgan Kaufmann, 1991. 

[9] H. P. Alesso and C. F. Smith, Developing Semantic 
Web Services. Wellesey, MA: A.K. Peters LTD, 2005. 



[10] F. Manola and E. Miller, "RDF primer: W3C 

recommendation," February 2004. [Online]. Available: 
http://www.w3.org/TR/rdf-primer/ 

[11] T. Berners-Lee, , R. Fielding, D. Software, 

L. Masinter, and A. Systems, "Uniform Resource 
Identifier (URI): Generic Syntax," January 2005. 

[12] N. F. Noy, W. Grosso, and M. A. Musen, "The 
knowledge model of Protege-2000: Combining 
interoperability and flexibility," in International 
Conference on Knowledge Engineering and Knowledge 
Management, Juan-les-Pins, France, 2000. 

[13] A. Magkanaraki, G. Karvounarakis, T. T. Anh, 
V. Christophides, and D. Plexousakis, "Ontology 
storage and querying," ficole Nationale Superieure des 
Telecommunications, Tech. Rep., April 2002. [Online]. 
Available: |http://139.91 183.30: 
|9090/RDF/publications/tr308.pdfr 

[14] E. Prud'hommeaux and A. Seaborne, "SPARQL 
query language for RDF," World Wide Web 
Consortium, Tech. Rep., October 2004. [Online]. 
Available: |http://www.w3.org/TR/2004/ 
| WD- rdf- sparql-query- 20041 l"27p 

[15] E. Garfield, "Journal impact factor: a brief review," 
Canadian Medical Association Journal, vol. 161, pp. 
979-980, 1999. 

[16] S. Staab, J. Angele, S. Decker, M. Erdmann, 

A. Hotho, A. Maedche, H. P. Schnurr, R. Studer, and 
Y. Sure, "Semantic community web portals," in 9th 
International World Wide Web Conference, 

Amsterda m, Netherlands, May 2000. [Online]. 

Available: http:/ /www9.org/w9cdrom/134/134.html 

[17] S. B. Shum, E. Motta, and J. Domingue, "Scholonto: 
an ontology-based digital library server for research 
documents and discourse," International Journal on 
Digital Libraries, vol. 3, no. 3, pp. 237-248, 2000. 
[Online]. Available: 

citeseer.ist.psu.edu/shum00scholonto.html 
[18] C. Lagoze and J. Hunter, "The ABC ontology and 
model," Journal of Digital Information, vol. 2, no. 2, 
2001. 

[19] K. Portwin and P. Parvatikar, "Building and 
managing a massive triple store: An experience 
report," in XTech: Building Web 2.0, Amsterdam, 
Netherlands, 2006. 

[20] G. Rust, "Ontologyx," in Functional Requirements for 
Bibliographic Records Workshop Proceedings, Dublin, 
Ohio, May 2005. 

[21] M. A. Goncalves, M. Luo, R. Shen, M. F. Ali, and 
E. A. Fox, "An XML log standard and tool for digital 
library logging analysis," in ECDL 2002: LNCS 2458, 
M. Agosti and C. Thanos, Eds. Berlin: 
Springer- Verlag, September 2002, pp. 129-143. 

[22] P. T. Shepherd, "Project COUNTER - Setting 

international standards for online usage statistics," 
Journal of Information Processing and Management, 
vol. 47, no. 4, pp. 245 - 257, 2004. 

[23] R. Lee, "Scalability report on triple store 

applications," Massachusetts Institute of Technology, 
Tech. Rep., 2004. 

[24] C. Bizer, "D2R - a database to RDF mapping 

language," in The Twelth International World Wide 
Web Conference (WWW03), Budapest, Hungary, May 



2003. 

[25] J. Bollen, M. A. Rodriguez, and H. Van de Sompcl, 
"Journal status," Scientometrics, vol. 69, no. 3, 
December 2006. 

[26] M. A. Rodriguez, "Grammar-based random walkers in 
semantic networks," Los Alamos National Laboratory, 
Tech. Rep. LA-UR-06-7791, 2007. [Online], Available: 
http://www.soe.ucsc.edu/~okram/papers/ 
random-grammar.pdf 



